I was looking for some sample code for getting my hands dirty with some recommender system code, particularly around collaborative filtering. I came across this blog post with the same title, which is based on the book Toby Seragan's book Programming Collective Intelligence.
This notebook is based on the orginal blog post. The code had to be adjusted to make it work with Python 3, and the latest version of pandas.
In [15]:
import numpy as np; import pandas as pd; from pandas import Series, DataFrame
The file contains ratings from different critics on various titles.
In [19]:
rating = pd.read_csv('data/movie_rating.csv')
rating.head()
Out[19]:
We will first create the matrix with titles of movies as rows and critics as columns. Each cell contains the rating from the corresponding user for a rating.
In [22]:
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')
rp
Out[22]:
The next step is to find the similarity score between the critics. We will use Toby as example, and use Pearson correlation score. Pandas contains the function corrwith() which compute the correlation. As you can see from the result below, Toby's taste is similar to Lisa Rose but not so much wit Gene Seymour.
Note that we could have used some other similarity metric such as cosine similarity.
In [37]:
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby
Out[37]:
To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assigns the similarity score and the weighted rating.
In [75]:
rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)
rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)
rating_c.head()
Out[75]:
Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.
In [25]:
recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)
Out[25]:
Putting it all together:
In [36]:
rating = pd.read_csv('data/movie_rating.csv')
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)
rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)
recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)
Out[36]: